Atlassian Summit: Agile Incident Response and Resolution in the World of DevOps

Devops has transformed how teams build and run software: you build it, you own it! In this new Devops world, the methods of responding to and resolving incidents have changed dramatically. You no longer have a 24/7 NOC handling incident response; instead, developers and ops engineers must work closely together and go on-call to triage alerts and resolve incidents.

At the same time, with more applications and services moving to the cloud, the impact of major outages has increased. Most enterprises have multiple major incidents each month and lose over $100K in revenue and lost productivity per incident.

This session describes how to effectively manage the entire incident lifecycle end-to-end, helping teams lower the duration and frequency of outages. We will cover the following topics:

  • How to prepare and train your team up-front to respond to incidents quickly.
  • How to handle triaging large volumes of alerts, identify the severity of issues, and loop in the right people ASAP.
  • How to manage and run an incident effectively across multiple responders and stakeholders.
  • How to create a learning feedback loop in your incident lifecycle by conducting a blameless post-mortem.

Overall, we will cover people, process, and tools as they relate to the incident lifecycle. In terms of tooling, we will feature a best-of-breed toolchain from Atlassian, PagerDuty, and other modern innovative companies.

Video

Slides